Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Jan 28, 2026

Summary

This PR aligns the default argument values in the benchmarks repository with the values used in the evaluation repository (OpenHands/evaluation).

Changes

Global defaults in args_parser.py:

These defaults are the same across all benchmarks and are set directly in args_parser.py:

Argument Default Reason
--workspace remote Production uses remote workspaces
--max-iterations 500 Sufficient for complex tasks
--critic finish_with_patch Ensures agent produces valid patches
--output-dir ./eval_outputs Standard output directory
--n-limit 0 No limit by default

Benchmark-specific defaults via config.py and parser.set_defaults():

Each benchmark has a config.py file with INFER_DEFAULTS (and optionally EVAL_DEFAULTS) that are applied via parser.set_defaults():

Benchmark INFER_DEFAULTS
commit0 dataset, split, repo_split, num_workers=8, max_attempts=1, max_retries=1
gaia dataset, split=validation, num_workers=30, max_attempts=3
swebench dataset, split=test, num_workers=30, max_attempts=3, max_retries=3
swebenchmultimodal dataset, split=dev, num_workers=30, max_attempts=3, max_retries=3
swtbench dataset (SWT-bench), split=test, num_workers=30, max_attempts=3, max_retries=3

Evaluation defaults (EVAL_DEFAULTS):

Benchmark EVAL_DEFAULTS
swebench dataset, workers=12
swebenchmultimodal dataset, split=dev, workers=12
swtbench dataset (SWE-bench), split=test, workers=24

Note: swtbench uses different datasets for inference (SWT-bench) vs evaluation (SWE-bench).

Benefits

  • Consistency: Running benchmarks locally now uses the same defaults as production
  • Maintainability: Clear separation between global defaults (args_parser.py) and benchmark-specific defaults (config.py)
  • Single source of truth: Each benchmark's config.py is the authoritative source for its defaults

Testing

  • All modified files pass pre-commit checks (ruff format, ruff lint, pycodestyle, pyright)
  • No functional changes to evaluation logic, only default values

Update args_parser.py and benchmark-specific run_infer.py files to use
default values that match the evaluation repository (OpenHands/evaluation)
eval-job/values.yaml configuration.

Shared defaults updated in args_parser.py:
- workspace: 'docker' -> 'remote'
- max-iterations: 100 -> 500
- critic: 'pass' -> 'finish_with_patch'

Benchmark-specific overrides using parser.set_defaults():
- gaia: dataset='gaia-benchmark/GAIA'
- swtbench: dataset='eth-sri/SWT-bench_Verified_bm25_27k_zsp'
- commit0: max_attempts=1, max_retries=1 (in addition to existing dataset)

Also updated AGENTS.md to document the default values alignment pattern.

Co-authored-by: openhands <openhands@all-hands.dev>
…hmultimodal

- swebench: Add explicit set_defaults(dataset, split) for consistency with
  other benchmarks, even though values match global defaults
- swebenchmultimodal: Update comment to match the pattern used in other benchmarks

Co-authored-by: openhands <openhands@all-hands.dev>
Each benchmark now sets its own dataset default via set_defaults(),
so no global default is needed.

Co-authored-by: openhands <openhands@all-hands.dev>
All benchmarks in the evaluation repository use .llm_config/runtime.json
as the LLM config path, so use this as the default.

Co-authored-by: openhands <openhands@all-hands.dev>
- Created config.py with INFER_DEFAULTS and EVAL_DEFAULTS for each benchmark
- Removed all defaults from utils/args_parser.py
- Removed default from critics.py
- Updated swebench, gaia, swtbench run_infer.py to use INFER_DEFAULTS
- Started commit0 update (import added)

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 29, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #377 at branch `align-default-values-with-evaluation`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@simonrosenberg
Copy link
Collaborator Author

@OpenHands fix precommit issues

@openhands-ai
Copy link

openhands-ai bot commented Jan 29, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link

openhands-ai bot commented Jan 29, 2026

I've fixed the pre-commit issues and pushed the changes to the PR branch.

Summary of Changes

The pre-commit checks were failing due to ruff lint errors related to import ordering. The fixes were:

  1. benchmarks/commit0/run_infer.py: Removed an unused import (from benchmarks.commit0.config import INFER_DEFAULTS)

  2. benchmarks/swebench/run_infer.py: Reordered imports to follow proper alphabetical ordering (moved from benchmarks.swebench.config import INFER_DEFAULTS after the build_images import block)

Verification

  • ✅ All pre-commit checks now pass (Ruff format, Ruff lint, PEP8 style check, Pyright type check)
  • ✅ Changes are minimal and only address the linting issues
  • ✅ Changes have been committed and pushed to the align-default-values-with-evaluation branch

The PR should now pass the pre-commit CI checks.

View full conversation

Align EVAL_DEFAULTS with NUM_EVAL_WORKERS from evaluation repository values.yaml:
- GAIA: workers=1
- Commit0: workers=1

Co-authored-by: openhands <openhands@all-hands.dev>
Update eval_infer.py files to import and use EVAL_DEFAULTS from their
respective config.py files via parser.set_defaults():
- swebench/eval_infer.py: uses EVAL_DEFAULTS for dataset, model_name, workers
- swtbench/eval_infer.py: uses EVAL_DEFAULTS for dataset, model_name, workers
- swebenchmultimodal/eval_infer.py: uses EVAL_DEFAULTS for dataset, split, model_name, workers

This ensures the default values defined in config.py are actually used
by the evaluation scripts, aligning with the pattern used in run_infer.py
files for INFER_DEFAULTS.

Co-authored-by: openhands <openhands@all-hands.dev>
…infer.py

Update run_infer.py files to import and use INFER_DEFAULTS from their
respective config.py files via parser.set_defaults():
- commit0/run_infer.py: uses INFER_DEFAULTS for all inference settings
- swebenchmultimodal/run_infer.py: uses INFER_DEFAULTS for all inference settings

This ensures the default values defined in config.py are actually used
by the inference scripts, completing the alignment with the evaluation
repository values.yaml.

Co-authored-by: openhands <openhands@all-hands.dev>
Update eval_infer.py files to import and use EVAL_DEFAULTS from their
respective config.py files via parser.set_defaults():
- commit0/eval_infer.py: uses EVAL_DEFAULTS for model_name
- gaia/eval_infer.py: uses EVAL_DEFAULTS for model_name

This ensures all benchmarks consistently use their config.py defaults.

Co-authored-by: openhands <openhands@all-hands.dev>
These fields are not benchmark-specific and should have global defaults:
- note: 'initial' (user-facing option for run identification)
- n_limit: 0 (no limit by default)
- output_dir: OUTPUT_DIR from constants.py ('./eval_outputs')

Added OUTPUT_DIR constant to benchmarks/utils/constants.py.

This keeps INFER_DEFAULTS focused on benchmark-specific values from
the evaluation repository's values.yaml.

Co-authored-by: openhands <openhands@all-hands.dev>
- gaia: Remove max_retries from INFER_DEFAULTS (not used in run_infer.py)
- gaia: Remove workers from EVAL_DEFAULTS (not used in eval_infer.py)
- commit0: Remove workers from EVAL_DEFAULTS (not used in eval_infer.py)

Each config now only contains fields that are actually used by the
corresponding run_infer.py and eval_infer.py scripts.

Co-authored-by: openhands <openhands@all-hands.dev>
Remove the default value 'initial' from --note argument. When not
specified, no note identifier is appended to the output directory.

The construct_eval_output_dir function already handles None/empty
values gracefully by not appending the _N_ suffix.

Co-authored-by: openhands <openhands@all-hands.dev>
Replace hardcoded dataset, split, and repo_split values with references
to INFER_DEFAULTS in:
- commit0/run_infer.py: Commit0Evaluation class __init__ and prepare_instances
- commit0/build_images.py: set only the specific defaults needed (dataset, split, repo_split)

This ensures all commit0 code uses the centralized config values.

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg force-pushed the align-default-values-with-evaluation branch from 98a6a58 to e53928f Compare January 29, 2026 10:21
The commit0 eval_infer.py is a simple JSON processor that doesn't need
centralized defaults. Reverted to main version.

Co-authored-by: openhands <openhands@all-hands.dev>
The gaia eval_infer.py is a simple JSON processor that doesn't need
centralized defaults. Reverted to main version.

Co-authored-by: openhands <openhands@all-hands.dev>
Import DEFAULT_DATASET, DEFAULT_CLI_MODEL_NAME, DEFAULT_EVAL_WORKERS from
constants.py instead of duplicating the values. This ensures constants.py
remains the single source of truth for these values.

Co-authored-by: openhands <openhands@all-hands.dev>
… config.py

Remove these constants from constants.py and update eval_infer.py to use
EVAL_DEFAULTS from config.py instead. config.py is now the single source
of truth for dataset, model_name, and workers defaults.

Co-authored-by: openhands <openhands@all-hands.dev>
…VAL_DEFAULTS

model_name is specific to the CLI and should stay in constants.py.
EVAL_DEFAULTS now only contains dataset and workers.

Co-authored-by: openhands <openhands@all-hands.dev>
Revert eval_infer.py files to main and remove model_name from EVAL_DEFAULTS.
The model_name is hardcoded in the eval_infer.py files.

Co-authored-by: openhands <openhands@all-hands.dev>
…nd swtbench eval_infer

Import EVAL_DEFAULTS and use parser.set_defaults() to apply them.
model_name remains hardcoded in the argument parser.

Co-authored-by: openhands <openhands@all-hands.dev>
…d_eval_env_images

Update image_utils.py, build_eval_env_images.py, and eval_infer.py to import
and use INFER_DEFAULTS instead of hardcoding dataset and split values.

Co-authored-by: openhands <openhands@all-hands.dev>
…EVAL_DEFAULTS

image_utils.py and build_eval_env_images.py are used for evaluation, so they
should use EVAL_DEFAULTS (princeton-nlp/SWE-bench_Verified) not INFER_DEFAULTS
(eth-sri/SWT-bench_Verified_bm25_27k_zsp).

Added split='test' to EVAL_DEFAULTS to match values.yaml.

Co-authored-by: openhands <openhands@all-hands.dev>
Revert AGENTS.md to main version.
Restore original docstring example in build_images.py.

Co-authored-by: openhands <openhands@all-hands.dev>
Set workspace default='remote' in args_parser.py since it's the same for all
benchmarks. Remove workspace from all INFER_DEFAULTS in config.py files.

Co-authored-by: openhands <openhands@all-hands.dev>
…ULTS

Set max_iterations default=500 in args_parser.py since it's the same for all
benchmarks. Remove max_iterations from all INFER_DEFAULTS in config.py files.

Co-authored-by: openhands <openhands@all-hands.dev>
Set critic default='finish_with_patch' in critics.py since it's the same for
all benchmarks. Remove critic from all INFER_DEFAULTS in config.py files.

Co-authored-by: openhands <openhands@all-hands.dev>
Revert benchmarks/utils/constants.py to main version.
Hardcode './eval_outputs' as default for --output-dir in args_parser.py.

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg marked this pull request as ready for review January 29, 2026 11:32
@simonrosenberg simonrosenberg self-assigned this Jan 29, 2026
@simonrosenberg
Copy link
Collaborator Author

@OpenHands
Read this PR diff
This PR aligns the default argument values in the benchmarks repository with the values used in the evaluation repository (OpenHands/evaluation).

Then clone the evaluation repository and simplify the code as much as possible. Now that the correct values are defined in the benchmarks repository, those values can be totally removed from the evaluation repository.
Try to simplify and clean the evaluation repo code and make a PR with your changes

@openhands-ai
Copy link

openhands-ai bot commented Jan 29, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

Add level to GAIA INFER_DEFAULTS matching production configuration.
Make --level argument optional since it now has a default.

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg
Copy link
Collaborator Author

@OpenHands
/codereview-roasted
Clone the evaluation repository.
Checkout values in eval-job/values.yaml
The purpose of this PR is to move all those hyperparameters from evaluation repo to the benchmarks repository

Please validate that

  1. no hyperparameter has changed values / no behavioral changes are introduced
  2. the logic of this PR is sound

@openhands-ai
Copy link

openhands-ai bot commented Jan 29, 2026

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Jan 29, 2026

Summary

I've completed the task of simplifying the OpenHands/evaluation repository to leverage the default values now defined in the OpenHands/benchmarks repository.

Changes Made

Created PR: https://github.com/OpenHands/evaluation/pull/223 (draft)

Files Modified:

  1. eval-job/values.yaml - Removed redundant inference/evaluation defaults:

    • Removed: WORKSPACE_TYPE, CRITIC_NAME, NUM_INFER_WORKERS, MAX_ITERATIONS, MAX_ATTEMPTS, MAX_RETRIES, OUTPUT_DIR
    • Kept: DATASET, DATASET_SPLIT, REPO_SPLIT, DATASET_INFER, DATASET_EVAL (needed for build workflow dispatch)
  2. run_swebench.sh - Simplified to only pass run-specific arguments (EVAL_LIMIT, MODEL_SLUG, MODEL_NAME)

  3. run_gaia.sh - Simplified to only pass run-specific and GAIA-specific arguments (GAIA_LEVEL, GAIA_SPLIT, EVAL_LIMIT, MODEL_SLUG)

  4. run_commit0.sh - Simplified to only pass run-specific arguments

  5. run_swtbench.sh - Simplified to only pass run-specific arguments

  6. run_swebenchmultimodal.sh - Simplified to only pass run-specific arguments

  7. run_eval.sh - Removed benchmark-specific environment variable requirements that are now handled by benchmarks defaults

Benefits

  • Reduced duplication: Default values are now defined in one place (benchmarks repo)
  • Simpler scripts: Shell scripts are shorter and easier to understand (~70 lines removed)
  • Easier maintenance: Changes to defaults only need to be made in the benchmarks repo
  • Consistency: Running benchmarks locally uses the same defaults as production

Dependency

This PR depends on OpenHands/benchmarks PR #377 which aligns the default argument values in the benchmarks repository. The evaluation PR should be merged after the benchmarks PR is merged.

View full conversation

@openhands-ai
Copy link

openhands-ai bot commented Jan 29, 2026

Code Review: PR #377 - Align Default Argument Values with Evaluation Repository

Taste Rating: 🟡 Acceptable - Works but has one critical gap


Hyperparameter Validation Summary

I compared the PR's config.py files against the evaluation repository's eval-job/values.yaml. Here's the validation:

✅ Values That Match Correctly

Benchmark Parameter PR Value Eval Repo Value Status
Global workspace remote WORKSPACE_TYPE: remote
Global max-iterations 500 MAX_ITERATIONS: 500
Global critic finish_with_patch CRITIC_NAME: finish_with_patch
Global output-dir ./eval_outputs OUTPUT_DIR: ./eval_outputs
SWE-bench dataset princeton-nlp/SWE-bench_Verified
SWE-bench split test
SWE-bench num_workers 30
SWE-bench max_attempts 3
SWE-bench max_retries 3
SWE-bench eval workers 12
SWT-bench dataset (infer) eth-sri/SWT-bench_Verified_bm25_27k_zsp
SWT-bench dataset (eval) princeton-nlp/SWE-bench_Verified
SWT-bench eval workers 24
Commit0 dataset wentingzhao/commit0_combined
Commit0 repo_split lite
Commit0 num_workers 8
Commit0 max_attempts 1
Commit0 max_retries 1
SWE-bench MM dataset princeton-nlp/SWE-bench_Multimodal
SWE-bench MM split dev
GAIA dataset gaia-benchmark/GAIA
GAIA split validation
GAIA level 2023_all
GAIA num_workers 30
GAIA max_attempts 3

[CRITICAL ISSUES] - Must Fix

🔴 [benchmarks/gaia/config.py] Missing max_retries

GAIA's INFER_DEFAULTS is missing max_retries, but the evaluation repo specifies MAX_RETRIES: "3" for GAIA.

Impact: The args_parser.py no longer has a default for max_retries. When gaia/build_images.py passes args.max_retries (which will be None) to build_all_images(), it will crash with:

assert max_retries >= 1, "max_retries must be at least 1"
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

Fix Required:

# benchmarks/gaia/config.py
INFER_DEFAULTS = {
    "dataset": "gaia-benchmark/GAIA",
    "split": "validation",
    "level": "2023_all",
    "num_workers": 30,
    "max_attempts": 3,
    "max_retries": 3,  # ADD THIS LINE
}

🔴 [Multiple build_images.py files] Build scripts don't set max_retries defaults

The following build scripts use args.max_retries but don't call parser.set_defaults() with max_retries:

  • swebench/build_images.py
  • swebenchmultimodal/build_images.py
  • gaia/build_images.py
  • commit0/build_images.py (only sets dataset, split, repo_split)
  • swtbench/build_eval_env_images.py (only sets dataset, split)

Impact: All build scripts will crash when max_retries=None is passed.

Fix Options:

  1. Add max_retries to each benchmark's INFER_DEFAULTS and use parser.set_defaults(**INFER_DEFAULTS) in build scripts
  2. Or restore a sensible default (e.g., 3) in args_parser.py for max_retries

[IMPROVEMENT OPPORTUNITIES]

🟡 [benchmarks/utils/args_parser.py, Line 80-84] Inconsistent None handling

For run_infer.py scripts, max_retries=None is silently converted to 3 by Pydantic's EvalMetadata model default. For build_images.py scripts, it causes a crash. This inconsistency is confusing.

Suggestion: Either restore a global default in args_parser.py, or ensure all config.py files include max_retries.


VERDICT

Needs rework - The GAIA max_retries omission and build script defaults must be addressed before merging.

KEY INSIGHT: The PR correctly aligns inference defaults but overlooks that build scripts also inherit from args_parser.py and need max_retries to be set somewhere.


Logic Soundness Assessment

The overall architecture is sound:

  • ✅ Centralizing defaults in config.py per benchmark is a good pattern
  • ✅ Using parser.set_defaults() is the correct approach
  • ✅ Separating INFER_DEFAULTS and EVAL_DEFAULTS makes sense
  • ⚠️ The gap is that build scripts weren't updated to use the new pattern consistently

View full conversation

@simonrosenberg
Copy link
Collaborator Author

@juanmichelini commit0 was set to max_retries max_attempts = 1 because commit0 runs are extremely long and it would always rerun 3 times (because back then the critic model finish_with_patch was incorrectly setting everyone to false)
Perhaps we should set both values to 3?

@simonrosenberg simonrosenberg force-pushed the align-default-values-with-evaluation branch from f27ac88 to d7be118 Compare January 29, 2026 12:17
- Keep default=3 for max_attempts and max_retries in args_parser.py
- Remove redundant max_attempts=3 and max_retries=3 from config.py files
  (gaia, swebench, swebenchmultimodal, swtbench) since they match the default
- Keep max_attempts=1 and max_retries=1 in commit0/config.py since it differs
  from the default
- Remove max_retries from commit0/build_images.py set_defaults (uses global default)

Co-authored-by: openhands <openhands@all-hands.dev>
@simonrosenberg simonrosenberg force-pushed the align-default-values-with-evaluation branch from d7be118 to 19be07f Compare January 29, 2026 12:26
@juanmichelini
Copy link
Collaborator

@juanmichelini commit0 was set to max_retries max_attempts = 1 because commit0 runs are extremely long and it would always rerun 3 times (because back then the critic model finish_with_patch was incorrectly setting everyone to false)

Perhaps we should set both values to 3?

max_retries should always be 3 since it's to mitigate infra errors. max_attempts instead can be set to 1

@juanmichelini
Copy link
Collaborator

@simonrosenberg why does commit0 only has 8 workers?

@juanmichelini juanmichelini self-requested a review January 29, 2026 12:58
@simonrosenberg
Copy link
Collaborator Author

@simonrosenberg why does commit0 only has 8 workers?

It was set like this at the start because 16 seemed too much to me. But 16 is actually nothing so it should be bumped to 16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants